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Abstract. We extend the Longstaff-Schwartz algorithm for approximately solving optimal 
stopping problems on high-dimensional state spaces. We reformulate the optimal stopping 
problem for Markov processes in discrete time as a generalized statistical learning problem. 
Within this setup we apply deviation inequalities for suprcma of empirical processes to derive 
consistency criteria, and to estimate the convergence rate and sample complexity. Our results 
strengthen and extend earlier results obtained by Clement, Lamberton and Protter (2002). 



1. Introduction 

The problem of arbitrage-free pricing American options has renewed the interest in efficient 
methods for numerically solving high-dimensional optimal stopping problems. In this paper we 
explain how to solve a discrete-time, finite-horizon optimal stopping problem by restating it as a 
generalized statistical learning problem. We give a unified treatment of the Longstaff-Schwartz 
and the Tsitsiklis-Van Roy algorithm. They use both Monte Carlo simulation and linearly param- 
eterized approximation spaces. We introduce a new class of algorithms which interpolate between 
the Longstaff-Schwartz and Tsitsiklis-Van Roy algorithm and relax the linearity assumption of 
the approximation spaces. 

Learning an optimal stopping rule differs from the standard setup in statistical and machine 
learning in the sense that it requires a series of learning tasks, one for every time step, starting 
at the terminal horizon and proceeding backward. The individual learning tasks are connected 
by the dynamic programming principle. At each time step, the result depends on the outcome 
of the previous learning tasks. Connecting the subsequent learning tasks to a recursive sequence 
of learning problems leads to an error propagation. We control the error propagation by using 
a Lipschitz property and a suitable error decomposition which relies on the convexity of the 
approximation spaces. Finally, we estimate the sample error with exponential tail bounds for 
the supremum of empirical processes. To apply these techniques we need to calculate the cov- 
ering numbers of certain function classes. An important type of function classes for which good 
estimates on the covering numbers exist are the so called Vapnik-Chcrvoncnkis (VC) classes, 
see Van der Vaart and Wellner (1996) or Anthony and Bartlett (1999). We prove that payoff 
functions evaluated at Markov stopping times parameterized by a VC-class of functions is again 
a VC-class. The covering number estimate of Haussler (1995) then gives the required bounds. 
Our approach is conceptually different from Clement et al. (2002), which is purely tailored to 
the classical Longstaff-Schwartz algorithm with linear approximation. By exploiting convexity 
and fundamental properties of VC-classes we can prove convergence and derive error estimates 
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under less restrictive conditions, also if both the dimension of the approximating spaces and the 
number of samples tends to infinity. 

This paper is structured as follows. The next background section discusses recent developments 
in numerical techniques for optimal stopping problems and summarizes the probabilistic tools 
which we use in this work. Section 3 reviews discrete-time optimal stopping problems. Section 
4 shows how to restate optimal stopping as a statistical learning problem and introduces the 
dynamic look-ahead algorithm. In Section 5 we state and comment our main results: a general 
consistency result for convergence, estimates of the overall error, the convergence rate, and the 
sample complexity. The focus of the work lies in estimating the sample error. The proofs are 
deferred to Section 6 where we also introduce the necessary tools of the Vapnik-Chervonenkis 
theory. 

2. Background 

Optimal stopping problems naturally arise in the context of games where a player wants to 
determine when to stop playing a sequence of games to maximize his expected fortune. The first 
systematic theory of optimal stopping emerged with Wald and Wolfowitz (1948) on the sequential 
probability ratio test. The monographs by Chow, Robbins and Siegmund (1971) and Shiryayev 
(1978) provide an extensive treatment of optimal stopping theory 

The general no-arbitrage valuation of American options in terms of an optimal stopping prob- 
lem begins with Bensoussan (1984) and Karatzas (1988). Nowadays, American option valuation 
is an important application of optimal stopping theory. For more background on American op- 
tions and financial aspects of the related optimal stopping problem we refer to Karatzas and 
Shreve (1998). 

2.1. Algorithms for Solving Optimal Stopping Problems. Optimal stopping problems 
generally cannot be solved in closed form. Therefore, several numerical techniques have been 
developed. Barone-Adesi and Whaley (1987) propose a semi-analytical approximation. The 
binomial tree algorithm of Cox, Ross and Rubinstein (1979) directly implements the dynamic 
programming principle. Other approaches comprise Markov chain approximations, see Kushner 
(1997), direct integral equation and PDE methods. The PDE methods are based on variational 
inequalities, developed in Bensoussan and Lion (1982) or Jaillet, Lamberton and Lapeyre (1990), 
the linear complementary problem, see Huang and Pang (1998), or the free boundary value 
problem, see Van Moerbeke (1976). However, the viability of any of these methods is prohibited 
by the curse of dimensionality For these algorithms the computing cost and storage needs grow 
exponentially with the dimension of the underlying state space. 

To address this limitation, new Monte Carlo algorithms have been proposed. The first land- 
mark papers in this direction are Boessarts (1989), Tilley (1993), and Broadie and Glasserman 
(1997). Longstaff and Schwartz (2001) introduce a new algorithm for Bermudan options in 
discrete time. It combines Monte Carlo simulation with multivariate function approximation. 
They show how to solve the optimal stopping problem algorithmically by a nested sequence of 
least-square regression problems and briefly outline a convergence proof. Tsitsiklis and Van Roy 
(1999) independently propose an alternative parametric approximation algorithm on the basis of 
temporal-difference learning. Their approach relies on stochastic approximation of fixed points 
of contraction maps. They prove almost sure convergence by using stochastic approximation 
techniques as developed in Kushner and Clark (1978), Bcnvcniste, Metiver and Priouret (1990), 
or Kushner and Yin (1997). The Longstaff-Schwartz as well as the Tsitsiklis- Van Roy algorithm 
approximate the value function or the early exercise rule and therefore provide a lower bound for 
the true optimal stopping value. Rogers (2002) proposes a method based on the dual problem 
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which results in upper bounds. The overview paper Broadie and Glasserman (1998) describes 
the state of development of Monte Carlo algorithms for optimal stopping as of 1998. A more 
recent reference is the book of Glasserman (2004) . A comparative study of various Monte Carlo 
algorithms for optimal stopping can be found in Laprise, Su, Wu, Fu and Madan (2001). 

Despite of the contributions of Tsitsiklis and Van Roy (1999), Longstaff and Schwartz (2001), 
and Rogers (2002), many aspects of Monte Carlo algorithms for optimal stopping such as con- 
vergence and error estimates remain unanswered. Clement et al. (2002) provide a complete 
convergence proof and a Central Limit Theorem for the Longstaff-Schwartz algorithm. But there 
are so far no results on more general possibly nonlinear approximation schemes, the rate of 
convergence or error estimates. These problems are the main topics addressed in this paper. 

2.2. Probabilistic Tools. The main probabilistic tools which we apply in this paper are ex- 
ponential deviation inequalities for suprcma of empirical processes. These tail bounds have 
been developed by Vapnik and Chervonenkis (1971), Pollard (1984), Talagrand (1994), Lcdoux 
(1996), Massart (2000), and Rio (2001) and many others. Compared to Central Limit Theo- 
rems, they are non-asymptotic and provide meaningful results already for a finite sample size. 
Deviation inequalities together with combinatorial estimates of covering numbers in terms of 
the Vapnik-Chervonenkis dimension are the cornerstones of statistical learning by empirical risk 
minimization. For additional details on statistical learning theory we refer to Vapnik (1982), 
Vidyasagar (2003), Anthony and Bartlett (1999), Vapnik (2000), Cucker and Smale (2001), 
Mendelson (2003a), Mendelson (2003b), and Gyorfi, Kohler, Krzyzak and Walk (2002). 

2.3. Basic Notations. The following terminology and notation will be used throughout this 
paper. If fi is a measure on a measurable space (M,A) we denote by L p (M,/j,) the usual L p - 
spaces endowed with the norm || If we need to indicate the measure space we write || \\ p ,m,^- 
Let d p ^ be the induced metric d p4l (f,g) = \\f — g\\ p .^. 

Let (M, d) be a metric space. If U C M is an arbitrary subset we define the covering number 

N(s, U, d) = mf{n e N | 3 {x±, . . . x„} C M such that Vx G U min^i,...^ d(x, Xi) < s} (2.1) 

which is the minimum number of closed balls of radius e required to cover U. The logarithm of 
the covering number is called the entropy. The growth rate of the entropy for e — > is a measure 
for the compactness of the metric space U. 

Let X, Ai, X2, ... be i.i.d. random elements on a measurable space [M, A) with distribution 
P. The empirical measure of a random sample Xi, . . . , X n is the discrete random measure given 

by 

1 " 

P„(A) = -Vi {M} , Ae A, (2.2) 

i=l 

or if g is a function on M 

1 " 

^nS = ~ £>(*i). (2.3) 

i=l 

The empirical measure is a random measure supported on (M°°, P°°,A°°) where M°° = PJ N M 
is the product space of countably many copies of M, P°° the product measure, and ^1°° the 
product cr-algebra. The random variables Xi can now be identified with the i-th coordinate 
projections. 
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3. Review of Discrete Time Optimal Stopping 

Let X = (X t ) t —o ... t be a discrete time R m -valued Markov process. We assume X is canon- 
ically defined on the path space X = R m x . . . x R m of T + 1 factors and identify X t with the 
projection onto the factor t. We endow X with the Borel cr-algebra B. Let T t be the smallest 
cr-algebra generated by {X s | s < t} and F = (J-t)t=o,...,T the corresponding filtration. 

Let P be the law of X on X and \i t = Px t the law of X t on R m . We introduce the spaces of 
Markov L p -functions 

L p (X)={h = (h Q ,...,h T ) | ht 6 L p (m. m ,fx t ), Vt = 0,...,T}, (3.1) 

with norm 

T T 

WHp = E INU = E E[\ht(Xt)\ P ] 1,P - (3.2) 
t=o t=o 

For brevity we drop the measures P, fit and the coordinate projections X t in our notation 

whenever no confusion is possible. Also, if h G L p (X) and x = (xo, ■ ■ ■ , %t) & X is a, point of the 
path space, we introduce the shorthand notation 

ft(x), = ht{x t ). (3.3) 

3.1. Discrete Time Optimal Stopping. In the following / <G ^i(X) is a nonnegative reward 
or payoff function. The optimal stopping problem consists of finding the value process 

V t = ess sup reT(t> _ tT) E[f T (X T ) | T t ], (3.4) 

where the supremum is taken over the family T(t, . . . , T) of all F-stopping times with values in 
t,. . . ,T. Adding a positive constant e to the payoff / just increases Vt by e. We therefore can 
assume without loss of generality that / G L\(X.) is a positive payoff function. A stopping rule 
T t * G T(t, . . . , T) is optimal for time t if it attains the optimal value 

Vt = E[f T ;(X Tt .)\F t ]. (3.5) 

Once the value process is known, an optimal stopping rule at time t is given by 

=m£{s>t\V s <f s {X s )}. (3.6) 

To exploit the Markov property of the underlying process X t we introduce the value function 

v t (x) = sup T6T(t> ... iT) S[/ T (A T ) \X t = x\. (3.7) 

The Markov property implies Vt = Vt(X t ). Closely related to the value process Vt is the process 

Qt = esssup T£r(t+lv .. )T) £[/ T pf T ) | T t ] = £[/ r . +i pT T . +i ) | T t ], (3.8) 

which is defined for all t = 0, . . .T— 1. Again, by the Markov property, we get the representation 
Qt = q t (X t ) where 

q t (x) = snp TeT{t+1 _ T) E[f T (X T ) \X t =x]= E[f Tt * +i (X T * +i ) \ X t = x\. (3.9) 

We extend the definition of qt up to the horizon T and set qx = fr- The function q t is referred 
to as the continuation value. It represents the optimal value at time t, subject to the constraint 
of not stopping at t. The value function and the continuation value are related by 

v t (X t )=max(MX t ),q t (X t )), q t (X t ) = E[v t+1 (X t+1 ) | X t ]. (3.10) 

The dynamic programming principle implies a recursive expression for the value, the continuation 
value, and the optimal stopping times. The recursion starts at the horizon T with Vt(Xt) = 
<?t(A/t) = /t(Xt) and proceeds backward for t = T — 1, . . . , according to 

vt(Xt) = max(ft(X t ), E[v t+1 (Xt +1 ) \ X t }), (3.11) 
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respectively 

q t (X t ) = E[mzx(f t+1 (X t+1 ),q t+1 (X t+1 )) \ X t \. (3.12) 

Similarly, the recursion for the optimal stopping rules r t * starts at the horizon T with = T. 
Given v t respectively q t and the optimal stopping rule r t * +1 at time t + 1, the optimal stopping 
rule r t * is determined by 

T t = tl {v t (X t )=f t (X t )} + T t+l 1 {vt(X t )>f t (X t )} 

= 1 1 {qt(x t )<f t (x t )} + T t+i l {q t (x t )>f t {x t )}- (3.13) 

From a theoretical point of view, the value function vt and the continuation value qt arc equivalent 
since they both provide a solution to the optimal stopping problem. However, from an algorithmic 
point of view, the continuation value is preferred. Indeed, qt tends to be smoother than vt because 
the max operation introduces a kink in the value function. We note that in continuous time this 
kink disappears, since by the smooth fit principle, the value function connects C^-smoothly to 
the payoff function along the optimal stopping boundary. 

Expression (3.13) for the optimal stopping rule suggests that we consider stopping rules pa- 
rameterized by functions h G Li(X) with Ht = fr- The terminal condition hr = /t reflects the 
terminal boundary condition = T . Let 

6f, t (h) = 6(f t -ht), 6j t (h) = l-9(ft-h t ), (3.14) 

where 9(s) = l{ s >o} is the heaviside function. Set Tr(h) = T and define recursively 

n(/i)(x) =t6f, t (h)(xt)+T t+ i(h)(x)6j !t {h){x t ), x e X. (3.15) 

For every h € Li(X) we get a valid stopping rule r t (h) which does not anticipate the future, 
because at each point in time t, the knowledge of X t is sufficient to decide whether to stop or to 
continue. 

Definition 3.1. The family of stopping rule {rt(h) | h 6 Li(X), hx = fr} is called the set of 
Markov stopping rules. 

The stopping rule r t {h) depends only on h t , . . . , hr-i and is therefore constant as a function 
of the arguments Xq,... ,Xt-i- Moreover, the recursion formula (3.13) implies that the optimal 
stopping rule r t * at time t is identical to the Markov stopping rule r t (q) . 

Applying the Markov stopping rule r t (h) leads to the cash flow f Tt (h)(X Tt /h))- More generally, 
we define for x G X, any < w < T — t, and h e ii(X) with hr — fr the function 

t+w s — 1 t+w 

)l[ej !r (h)(x r ), (3.16) 

s—t r—t r—t 

where wc follow the convention that the product over an empty index set is equal to one. The 
function , dp.w{f,h) has a natural financial interpretation. It is the cash flow we would obtain 
by holding the American option for at most w periods, applying the stopping rule Tt(h), and 
selling the option at time t + w for the price of ht+ w (Xt+ w ), if it is not exercised before. We call 
$t:w(f, h) the cash flow function induced by h. 

Equations (3.9) and (3.12) provide two different representations of qt. In terms of $p.w{f,h) 
they can be reexpressed as follows. Because f T ; +l {X T * +i ) = f Tt+1 (q)(X Tt+1 (q)) = &t+i:T-t-i(f,q), 
(3.9) becomes 

q t {X t )=E[# t +v.T-t-i(f,q) I X t ], (3.17) 
whereas & t +i:o(f,q) = max(/ t+ i, q t+1 ) turns (3.12) into 

q t (X t ) = E [t? f+ i:o (f,q) \ Xt]- (3.18) 
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In fact, there is a whole family of representations, parameterized by w G {0, . . . ,T — t — 1}. 
Recursively expanding qt+i, • ■ ■ , qt+w in (3-12) and using the Markov property we find that 

q t {X t )=E[§ t+1:w {f,q)\X t ], (3.19) 

for any < w < T - t - 1. 

4. Optimal Stopping as a Recursive Statistical Learning Problem 

The calculation of the recursive series of nested regression problems (3.19) is becoming in- 
creasingly demanding for high dimensional state spaces. A further complication is introduced if 
the transition densities of the Markov process X are not explicitly available. In this case, the 
only means to assess the distribution of the Markov process is by simulating a large number of 
independent sample paths Xi , X2, . . . , X„. These kind of problems arc considered in statistical 
learning theory. 

4.1. Dynamic Look- Ahead Algorithm. Assume a payoff / € Z/2(X). We interpret the un- 
known continuation value q t € L2(M. m , nt) as an approximation of the unknown optimal cash 
flow , &t+i:w{fi q)i m the sense that it only depends on the state of the underlying Markov process 
at time t. To reduce the problem further we choose for every t > a suitable set of functions Ht 
defined on M m . Let 

H = {h = (h , . . . , h T ) : X — > R T+1 \ h t e «*}. (4.1) 

Given a finite amount of independent sample paths 

D n = {Xi,...,X„}, (4.2) 

we want to find a learning rule qn, i.e., a map 

qn ■ D n ^ q H {D n ) = (q H ,o(D n ), . . .,qn,T{ D n)) € H, (4.3) 

such that qH,t(D n ) provides an accurate approximation of "dt+i-.wif, q) in J~it- The dynamic 
programming principle imposes consistency conditions on a learning rule. 

Definition 4.1. A learning rule qu is called admissible if qn,T(D n ) = /t and qn,t{D n ), as 
a function of D ni does not depend on the sample paths up to and including time t — 1, or 
equivalently, is a function of {Xi^ s \ s > t, i = 1, . . . , n} alone. 

We apply empirical risk minimization to recursively define an admissible learning rule as 
follows. At the horizon T we set 

qn.T(Dn) = It- (4.4) 
For t < T, equation (3.19) suggests that we approximate the cash flow function 

Vt+iM qn(D n )), (4.5) 

for some suitably selected parameter w = w(t) € {0, . . . , T — t — 1}. We choose 

qH,t( D n) = argminP n |/i - d t +hw(f, qn{D n ))\ 2 

h£H t 

1 " (4.6) 

= argmin- V|fr(X M ) - ti t +i;w(f, 9w( J D„))(X 4 )| 2 , 

which is an element of Tit with minimal empirical L2-distance from the cash flow function (4.5). 
Because the objective function in the optimization problem (4.6) depends solely on the functions 
qn,s{D n ), s = t + 1, . . . ,t + w + 1, we see by induction that the empirical risk minimization 
algorithm (4.6) indeed leads to an admissible learning rule. 
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Remark 4.2. It is important to note that, while the function q-H(D n ) is a function of x g X, 
its choice depends on the sample D n . Therefore, qn(D n ) is a random element with values in TL 
which is defined on the countable product space (X°°, P°°, J-°°). Strictly speaking, for a sample 
size n only the first n coordinates of X°° are relevant. Analogously, the expectation 

E[q n (D n )} = f q H (D n )(x)dP(x) (4.7) 
J x 

of qn{D n ) over the path space X is still a random variable on X°°. 

Definition 4.3. The dynamic look-ahead algorithm with look-ahead parameter w = w(t), < 
w(t) < T — t — 1, approximates the continuation value q t by the empirical minimizer q^ t(D n ) of 
(4.6). 

The cash flow (4.5) depends on the next w + 1 time periods, hence, it "looks ahead" w + 1 
periods. The algorithm is called "dynamic" because the look- ahead parameter w may be chosen 
time and sample dependent. We simplify our notation and drop the explicit dependency on the 
sample D n , the sample size n, and the look-ahead parameter w, writing qu.t for the solution of 
the empirical minimization problem (4.6). 

4.2. Tsitsiklis-Van Roy and Longstaff-Schwartz Algorithm. Both the Tsitsiklis-Van Roy 
and the Longstaff-Schwartz algorithm are special instances of the dynamic look-ahead algorithm. 
The Longstaff-Schwartz algorithm is based on the cash flow function 

tff+l = fr t+1 (q n ) ( X n +1 {q n )) > ( 4 - 8 ) 

which corresponds to the maximal possible value w = T — t—1. On the other extreme, the choice 
w = in (4.5) results in the much simpler expression 

= max(/( + i,« w ), (4.9) 

used in the Tsitsiklis-Van Roy algorithm. In its initial form, this algorithm has been developed 
to solve infinite horizon optimal stopping problems of ergodic Markov processes. The advantage 
of is its numerical simplicity. On the other hand, $t+i is better suited to approximate the 
optimal stopping rule because it incorporates all future time points up to the final horizon. This 
property is particularly important for Markov process with slow mixing properties. 

The dynamic look- ahead algorithm introduced in Definition 4.3 interpolates between the 
Tsitsiklis-Van Roy and the Longstaff-Schwartz algorithm. A dynamic adjustment of the look- 
ahead parameter if = w(t) allows us to combine the algorithmic simplicity of Tsitsiklis-Van Roy 
and the good approximation properties of the Longstaff-Schwartz approach. For instance we may 
increase w(t) for the last few time steps to compensate the slow mixing of the Markov process. 

5. Main Results 

In our definition of the dynamic look-ahead algorithm (4.6) we did not further specify the 
approximation scheme. The richer the set of functions Tit, the better it can approximate the 
optimal cash flow. On the other hand large sets Tit would require an abundance of samples 
to get a minimizer in (4.6) with reasonably small variance. These conflicting objectives are 
generally referred to as the bias-variance trade-off. To get a reasonable convergence behavior of 
the dynamic look-ahead algorithm, we need to impose some restrictions on the massiveness of 
the approximation spaces H. t and relate it to the number of samples which are used to calculate 
the minimizcrs in (4.6). 

The massiveness of a set of functions can be measured in terms of covering and entropy 
numbers. The calculation of covering numbers of classes of function has a long history dating 
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back to Kolmogorov and Tikhomirov (1959) and Birman and Solomyak (1967). We refer to 
Carl and Stephani (1990) for a modern approach and additional references. An important type 
of function classes for which covering numbers can be estimated with combinatorial techniques 
are the so called Vapnik-Chcrvoncnkis classes or VC-classes, which are by definition classes of 
functions of finite VC-dimcnsion. Informally speaking, the VC-dimcnsion measures the size of 
nonlinear sets of functions by looking at the maximum number of sign alternations of its elements. 
To give a precise definition we consider a class of functions Q defined on some set S. A set of n 
points {xi, . . . ,x n } C S is said to be shattered by Q if there exists r £ M" such that for every 
b £ {0, 1}™, there is a function g £ Q such that for each i, g(xi) > Ti if 6, = 1, and g{xi) < Vi if 
bi = 0. The VC-dimcnsion vc(Q) of Q is defined as the cardinality of the largest set of points which 
can be shattered by Q . The function classes that will appear in the analysis of the fluctuations of 
the empirical minimizers (4.6) very well fit in the theory of Vapnik-Chervonenkis. We introduce 
the necessary tools of the VC-thcory on the way as we prove the main results in Section 6. 

Our error decomposition crucially depends on the convexity and the uniform boundedness of 
the class of functions Ht- We will impose for all t > the following three conditions. 

(Hi) The class Ht is a closed convex subset of L p (M. m , /j, t ) for some 2 < p < oo. 

(H2) There exists a constant d such that the VC-dimension of Ht satisfies vc(Ht) < d < 00. 

(H3) The class Ht is uniformly bounded, i.e., for some constant H, \ht\ < H < 00 \/h t £ Ht- 

The convexity and uniform boundedness assumptions (Hi) respectively (H3) are somewhat re- 
strictive, but encompass many common approximation schemes such as bounded convex sets in 
finite dimensional linear spaces, local polynomial approximations, or tensor product splines. 

5.1. Consistency and Convergence. The payoff function of an optimal stopping problem is 
often unbounded. For example, in option pricing even the simplest payoff functions of American 
put and call options increase linearly in the underlying. On the other hand, any numerical 
algorithm works at finite precision and tight error or convergence rate estimates rely on some 
sort of boundedness assumptions. We therefore introduce the truncation operator Tp which 
assigns to a real valued functions g the bounded function 



and to g £ L p (X) its coordinate-wise truncation Tpg = (Tpgo, . . . , Tpgx)- We then replace the 
estimator (4.6) by 



where Tp n f is the payoff truncated at a threshold j3 n . The estimator (5.2) rests on the hy- 
pothesis that whenever q-H n ,s(D n ) is an approximation of q s for s > t + 1, then the cash flow 
^t+i:w{Tf3 n f,qn„(D n )) is a sufficiently accurate substitute for the unknown optimal cash flow 
&t+i;w(Tp n f, q). We justify this hypothesis in Proposition 6.4 by proving a conditional Lipschitz 
continuity of the functional h 1— > $t+i:w(Tp n f, h) at q. The error propagation of the recursive esti- 
mation procedure is resolved in Corollary 6.2, which relies on the convexity of the approximation 
architecture. 

The first main result provides a sufficient condition on the growth of the number of sample 
paths n, the VC-dimension vc(H n ,t) of the approximation spaces H n> t, and the truncation level 
(3 n to ensure convergence. Let (X°° , P°°, be the countable product space introduced in 
Remark 4.2. We use the notation P = P°° and denote by E the expectation with respect to P. 




(5.1) 



<tH n ,t = qn n .t{D n ) = argminP„|/i - i9 t +i:w(t)(T/3 n f,qH n (D n ))\ 2 , 



(5.2) 
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Theorem 5.1. Assume the payoff f is in L 2 (X) andH n is a sequence of approximation spaces 
uniformly bounded by f3 n such that L) ( ^' =1 Tl n is dense in L2(X). Furthermore, assume that each 
7in t t is closed, convex, and vc(H n _t) < d n . Let q\i n ,t be the empirical L^-minimizer from (5.2) 
for a look-ahead parameter < w(t) < T — t — 1. Under the assumptions 

a a d n (3 2 n \og(l3 n ) 

p n — > oo, a„ — > oo, > (n — > oo), (5-oj 

n 

it follows that 

||$H„,t-«t||2-0, (5.4) 
in probability and in Li(P). If furthermore 

^log(n) 

► 0, 

n 

then the convergence in (5.4) holds almost surely. 



(5.5) 



Proof. See Section 6.3. □ 

Theorem 5.1 proves convergence of the truncated version (5.2) of the dynamic look-ahead 
algorithm. It generalizes previous results in two directions. First, the number of samples, the size 
of the approximation architecture (measured in terms of the VC-dimension), and the truncation 
threshold are increased simultaneously. Glasserman and Yu (2003) address the same question 
for the Longstaff-Schwartz algorithm with linear finite dimensional approximation. They avoid 
truncation by imposing fourth-order moment conditions and find that the number samples must 
grow surprisingly fast. For example, if X t is log-normally distributed and n denotes the dimension 
of the linear approximation space, the number of samples must be proportional to exp(n 2 ). 
Second, Theorem 5.1 covers approximation architectures of bounded VC-dimension and does not 
depend on the law of the underlying Markov process. For instance, the convergence proof of 
Clement et al. (2002) relies on the additional assumption P(q = /) = 0. 

In (5.2) we reduce unbounded to bounded payoffs by truncating at a suitable cutoff level. The 
next result bounds the approximation error in terms of the cutoff level. 

Proposition 5.2. Let 1 < p < oo and f £ L p (X) be a nonnegative payoff function. If qp is the 
continuation value of the truncated payoff Tpf, it follows that 

lift - qp,t\\p -> °. ( 5 - 6 ) 

for (3 — ► oo, and if 1 < r < p, then 

H - qpAr < E ( r / > u ) du ) r < oifl-^). (5.7) 

s=t+l J P 

Proof. See Section 6.5. □ 

The bound (5.7) can be refined in terms of Orlicz norms. The Orlicz norm of a random variable 
Y is defined as 

||Y|U = inf{C > | E [V (\Y\C- 1 )] < 1}, (5.8) 

where tf> is a nondecreasing, convex function with "0(0) = 0. Note that ip{y) = y p reduces to the 
usual Lp-norms. If ||/t+i||^, < oo Markov's inequality implies the tail bound 

P(ft+i >u)< - 1 (5.9) 

V>HI/t+i|L ) 
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which we then can apply to the middle term in (5.7). In particular, tp p (x) = exp(x p ) — 1 leads 
to the exponential bound 

P(/ t+ i>u)<exp(-^||/ t+1 ||^) (l-cxp(-/31./W 1 ||^ 1 ))~\ (5.10) 

for all u > p. In financial applications a typical situation is f t +\ = /(exp(X t+ i)), where X t +i is 
normally distributed and f(y) < Cy q has polynomial growth. The tail estimate 

P(ft+i >u)<0 ^^-L^ cxp (-\og(u) 2 )^j (5.11) 

is a direct consequence of the well-known asymptotic expansion 

1 - < <j)(u)u- 1 (l - i + + 0(u- 6 )^j (5.12) 

for the tail of the standard normal distribution $ with density <f>. (5.11) improves the rate of 
order 0(/3 1-p / r ) in (5.7) considerably, despite of the logarithmic terms in the exponent. 

5.2. Error Estimate and Sample Complexity. Theorem 5.1 shows that simultaneously in- 
creasing the truncation threshold, the VC-dimcnsion of the approximation architecture, and the 
number of samples at a proper rate, the resulting estimator (5.2) converges to the solution of the 
optimal stopping problem. Proposition 5.2 quantifies the error of an initial truncation at a fixed 
threshold. We continue the error analysis of the dynamic look-ahead algorithm by truncating 
unbounded payoffs at a sufficiently large threshold and considering a single approximation 
architecture TL. The second main result bounds the overall error for bounded payoff functions 
in terms of the approximation error and the sample error, generalizing the familiar bias- variance 
trade-off in nonparametric regression and density estimation. 

Theorem 5.3. Consider a payoff f G Loo(X) with ||/t||oo < ©• Assume that each Tit is a closed 
convex set of functions, uniformly bounded by H , with vc(7it) < d. Let q-H,t{D n ) be the empirical 
L2-minimizer from (4.6) for a look-ahead parameter < w(t) < T — t — 1. Set (3 = max(0, H). 
Then, for n > 382/3 2 /e, 

E [Mn,t(D n ) - q t \\j] < 2-16 t °W max inf + (5.13) 

L J s=t,...,t+w(t) + l h£H s 



2 • m^\w(t) + 2) f 6998/3 2 + log(6998if/3 2 ) + v\og(n) 
\ n n 

v = 2d(c(w(t)) + 1), K = 6e 4 (d + l) 2 (c(w{t))d + l) 2 (1024e/?) t \ 



and 

c(w(t)) = 2(w(t) + 2) Iog a (e(«;(t) + 2)). 

Proof. See Section 6.3. □ 

The effectiveness of a learning algorithm can be quantified by the number of samples which 
are required to produce with high confidence 1 — 5 an almost minimizer 

\\qn.t(Dn) - qt\\l < inf \\h t - q t \\ 2 2 + e , Vi = 0...,T-l, (5.14) 

for a certain error accuracy s. In (5.14) the error is measured relative to the minimal approxi- 
mation error at time step t. It is evident from (5.13) that an accurate estimate is only obtained 
if the approximation error in all previous learning tasks is small as well. To disentangle sample 
complexity and approximation error, we measure the performance of the learning rule relative to 
the overall approximation error in (5.13). 
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Corollary 5.4. Assume f G Loo(X) with ||/t||oo < @ o/nd let H be as in Theorem 5.3. The 
sample complexity 

c(e, 8) = min< no | Vn > no, 

r \ 1 <5 " 15> 

J" ll9x.,(£>„)-? ( |g>2-16-"> max tef - 9 ,||1 + e <j \ 

\ s=t,...,t+w{t) + l hf£H s J 



of the empirical Li-minimizer (4.6) is bounded by 



1. f K\ , /l 



c(e,5) < 2-13996(w(i) + 2)16"' (t) /3 2 niax^log^yJ ,ulog( ^jj , (5.16) 
where (3, v and K are as in Theorem 5.3. 

Proof. See Section 6.3. □ 

Theorem 5.3 and Corollary 5.4 estimate the sample error for a fixed approximation scheme and 
truncation threshold. The bound (5.13) and the complexity estimate (5.16) hold uniformly for 
any law of X and payoff function / with ||/||oo < ©■ Hence, the bounds are independent of the 
distribution of the underlying Markov process, the optimal stopping time, and the smoothness 
of the continuation value. The asymptotic rate (9(log(n)n _1 ) of the sample error (the second 
term on the right hand side of (5.13)) is typical for nonparametric least square estimates with 
approximation schemes of finite VC-dimension, see, e.g., Gyorfi ct al. (2002, Theorem 11.5). 

If we impose additional assumptions on the smoothness of the continuation value q the approx- 
imation errors inf^g?^ s \\h — g s ||| in (5.13) can be estimated further by approximation theory. 
Smoothness assumptions are not unreasonable. Although for many financial applications the 
payoff is only continuous or piecewise continuous, the continuation value is often smooth. The 
degree of smoothness of q is crucial for how to choose approximation spaces Tt n to get the most 
favorable rate of convergence by properly balancing the approximation error and the sample 
error. 

Smoothness is often measured in terms of Sobolev spaces W k (L p (fl, A)), where fJ C K m is a 
domain in R m and A is the Lebesgue measure on f2. These are functions g G L p (Q,,X) which 
have all their distributional derivatives of order up to k in L p (fl, A). The Sobolev (semi-)norm 
Hsllp,fc,f2,A m &y be regarded as a measure of smoothness for a function g G W (L p (il, A)). 

In practical applications of the Longstaff-Schwartz algorithm approximation by polynomials 
performs rather well. Let V T be the space of multivariate polynomials on R m with coordinate 
wise degree at most r — I. For simplicity we assume X t is localized to a sufficiently large cube 
/ C K m . This assumption can be satisfied by applying a truncation argument similar to the one 
developed in Proposition 5.2. 

Corollary 5.5. Assume that X t is localized to a cube I C M. m , f G L 00 (X.) ! and that the 
continuation value q t is in the Sobolev space W k (L aD (I, A)) for all t. Define the sequence of 
approximation architectures 

H n , t V n i/( m +2k) | ||p||oo,/,A < 2||g f ||oo,fc,/,A}- (5-17) 

Then, 

E [\\qn n AD n ) - ftHi] < O (log(n)n-^) . (5.18) 
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If fi t has a bounded density with respect to the Lebesgue measure and q t G W k (L p (I, A)) for some 
p > 2 the same result holds if we replace H. n ,t in (5-17) by 

Ti n ,t ={p£ V n i/( m +2k) | |H| p j,a < 2\\qt\\ P ,kj,x}- (5.19) 

Proof. The result essentially follows from Jackson type estimates, Theorem 6.2 in Chapter 7 of 
DeVore and Lorentz (1993). See Section 6.4. □ 

Corollary 5.5 is a prototypical application of Theorem 5.3 to global approximation by polyno- 
mials. Other approximation schemes can be treated similarly, as long as the conditions (Hi)-(Ha) 
are satisfied. To get the rate stated in Corollary 5.5 the dimension n m /( m + 2k ) f the polynomial 
approximation architecture (5.17) has to grow with increasing sample size such that the approx- 
imation error and the sample error are balanced. The rate (5.18) is up to a logarithmic term the 
lower minimax rate of convergence for estimating regression functions, see Stone (1982). 

5.3. Discussion and Remarks. The Longstaff-Schwartz algorithm and its generalization, the 
dynamic look-ahead algorithm, perform surprisingly well for many practical applications such as 
pricing American options which are not too far in or out of the money. This empirical observation 
can be explained as follows. It follows from (3.19) that an approximation of the optimal cash flow 
$t+i:«)(/) q) can be used to estimate the continuation value at time t. A closer look at definition 
(3.16) shows that for the maximal possible value w = T — t — 1 the cash flow i9t+i-.w(f, h) is 
close (in the L2-sense) to the optimal 'dt+i-.wif, q) if the signs of / — h and f — q disagree only 
on a subset of the path space with small probability, or equivalently if the probability of the 
symmetric difference 

P({/-^>0}A{/- g >0}), (5.20) 
is small. Note that a small probability (5.20) does not necessarily entail that the functions h and q 
are close in the i2-sense. If the look-ahead parameter w satisfies w < T — t—1 then $t+i-.w{f, h) is 
a good approximation of the optimal cash flow if, in addition to a small probability (5.20), also the 
L2-distance between ht+ w +i and the unknown continuation value qt+w+i is small. Consequently, 
a look-ahead parameter < w < T — t — 1 requires good approximations for q w +i, . ■ ■ , qr-i- 
Determining accurate and stable estimators for q t with t close to 1 may be difficult to achieve, 
in particular if the samples of the Markov process do not cover sufficiently large parts of the 
state space. This explains why the Tsitsiklis-Van Roy algorithm (corresponding to w = 0) may 
perform badly for finite horizon problems. 

As opposed to the empirically demonstrated efficiency of the Longstaff-Schwartz algorithm, the 
results of Theorem 5.3 and Corollary 5.4 are somewhat pessimistic. For practical parameter values 
e, 5, d,w, and large enough cutoff level /?, the sample complexity bound (5.16) leads to a very large 
sample size. The reason for the pessimistic sample size estimates is twofold. First, the estimator 
qu is sensitive to error propagation effects caused by the backward induction. This leads to error 
estimates such as (5.13) which depend exponentially on the number of look-ahead periods w(t). 
The minimal choice w = would resolve the exponential dependence but, as explained above, may 
have limited capabilities to approximate the optimal cash flow. Another reason is the generality 
of our error estimates. Wc already observed that qu leads to an accurate approximation of the 
optimal cash flow if the probability of the symmetric difference P({f — qn > 0}A{/ — q > 0}) 
is small. However, it is difficult to derive error estimates which take this effect into account 
without imposing additional assumptions on the smoothness of the payoff and the distribution 
of the stopping time in the neighborhood of {q = f}- 

We considered in this work estimators based on straightforward empirical L2-risk minimiza- 
tion. A deficiency of the simple estimator considered in Corollary 5.5 is that the degree of smooth- 
ness and an upper bound for ||<?t||oo,fc,J,A has to be known. There exists a variety of advanced 
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nonparamctric regression estimators which have been developed to cope with the shortcomings 
of the basic empirical risk minimization procedure. The main generalizations in this direction 
are sieve estimators, studied for example by Shen and Wong (1994), Shcn (1997), and Brige and 
Massart (1998), adaptive methods such as complexity regularization, penalization, and model 
selection, see Barron, Brige and Massart (1999), Gyorfi et al. (2002), and the reference therein. 

The benefit of conditions (Hi)-(H3) is that convexity arguments and VC-techniques lead to 
error estimates without the necessity of imposing further assumptions on the Markov process 
X. On the downside, some important commonly used approximation schemes are excluded. For 
instance, condition (H2) conflicts with approximation in Sobolev or Besov balls, which have 
infinite VC-dimension, and the convexity condition (H3) is incompatible with many interesting 
nonlinear approximation schemes, such as n-term approximation, wavelet thresholding, or neural 
network architectures. 

A promising approach to extend and refine the results of this work is to approximate the cash 
flow "df.w{fi h) by a suitably smoothed version with better Lipschitz continuity properties. We 
then can express the massiveness of the approximation schemes directly in terms of covering 
numbers and exploit the dependency of the covering numbers on the radius of the function 
class. The additional step of first bounding the VC-dimension becomes unnecessary. However, 
this approach is of less generality because it depends on the additional assumptions that the 
probability P({\q — f\ < e}) decays to zero as e — > and the semi-group generated by the 
Markov process X has good smoothing properties. 

Once we have selected a sequence of approximation architectures TL n ,t the final step towards 
an implementation is to determine a computationally efficient algorithm that minimizes the 
empirical L2-risk (5.2) over Ti. n j in a polynomial number of time steps. Unfortunately, for many 
approximation spaces, such as certain neural network architectures, constructing a solution which 
nearly minimizes the empirical L2-risk turns out to be NP-complete or even NP-hard. Thus there 
might still exist serious complexity theoretic barriers to efficient numerical implementations of 
specific approximation schemes. 

5.4. Acknowledgements. The author would like to thank Paul Glasserman, Tom Hurd, Markus 
Leippold, Maung Min-Oo, and Paolo Vanini for helpful discussions. The detailed comments and 
suggestions of a referee greatly helped to improve a first version of this paper. 

6. Proofs 

The proof of the main results, Theorem 5.1 and 5.3, is divided into tree steps. The strategy is 
as follows. First, we prove in Corollary 6.2 an error decomposition in terms of an approximation 
error and an expected centered loss (6.3). The second step is to estimate the covering numbers 
of the so called centered loss class (6.28), see Corollary 6.10. The last step is to apply empirical 
process techniques to bound the fluctuation of the expected centered loss in terms of the covering 
numbers. 

6.1. Error Decomposition. We assume from now on without further mentioning that H C 
L2(X) and that all approximation spaces Tit are closed and convex. Before we can state our 
main error decomposition we need to introduce some more notation. Let 

ir Ht :L 2 {R d ,fi t )^Ht (6.1) 

denote the projection onto the closed convex subset Tit C L2(R m , fit) and set 

pv Ht = ir Ht o E[ ■ I X t = ■ ] : L 2 (X , P) -> H t . (6.2) 
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For any h = (h , . . . , Ht) ■ X — > M. T+1 with hr — fr wc introduce the centered loss 

k(h) = \ht - 0t+iM h)\ 2 - | Wn t #t+iM h) - &t+u w (f, h)\ 2 . (6.3) 

In favor of a more compact notation we have dropped the dependency of lt{h) on the look-ahead 
parameter w. Note that the centered loss lt(h) only depends on h t , . . . ,hr—i and can take on 
negative values. However, E[l t (h)] > as we will see in Lemma 6.3 below. 

We decompose the overall error into an approximation error, a sample error, and a third 
term which captures the error propagation caused by the recursive definition of the dynamic 
look-ahead estimator. 

Proposition 6.1. Assume that qy, is the result of an admissible learning rule. Then 

t+w+l 

Mn,t-qth < inf \\h-q t y + E[k(hi)] 1/2 + 3 V \\q n , s ~ fella- (6.4) 

nG/tt — 

s=t+l 

In general we cannot approximate r &t+i:w(f,q'H.) by functions h t G i2(R d ,/if) arbitrarily well 
and therefore 

inf E[\h t -dt+i: W (fAH)\ 2 ] >0. (6.5) 

For this reason we base our error decomposition (6.4) on the more complicated centered loss 
function which expresses the sample error relative to the optimal one-step expected loss 

E[\ WH J t+1:w (f,qH)-# t+1:w (f,q H )\ 2 ]. (6.6) 

The first term on the right-hand side of (6.4) is the approximation error, a deterministic quantity, 
which can be analyzed by approximation theory. The second term E^^q-n)] 1 ^ 2 is usually referred 
to as the sample error. The last term in (6.4) collects the error propagation introduced by the 
previous learning tasks through the dynamic programming backward recursion. 



Corollary 6.2. Let 

denote the one-step error. Then, 



et = inf \\h- qth + E[l t {q n )] 1/2 (6.7) 



t+w+l 

Un,t - qth < e t + 3 £ 4 s " t - 1 e s , (6.8) 

s=t+l 

and 



\\9H,t ~qth< 4™ +1 max inf \\h - q s \\ 2 + E[l s {qn)] l/2 ■ (6.9) 

s=t,...,t+w+i yti£H s j 

Proof of Corollary 6.2. This follows at once from (6.4) by recursively inserting the error estimate 
(6.4) for s>t+l. □ 

The proof of the error decomposition (6.4) crucially relies on the convexity of the approxima- 
tion spaces, Lemma 6.3, and a Lipschitz estimate for i? t+ i :u ,(/, h) as a function of h, Proposition 
6.4. 

Lemma 6.3. Denote by 

Pt{h){x) = E[& t+1:w (f, h)\X t = x] (6.10) 
the regression function of 'dt+i-.wif, h). For any h G Ti. with Ht = fr 

\\h t - K HtPt (h)\\l = \\h t - w Ut #t+i: W (f, h)\\ 2 2 < E[l t (h)}. (6.11) 
In particular E[l t {h)\ > 0. 
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Proof. The proof is identical to the proof of Lemma 5 in Cucker and Smale (2001). Because 
Pt{h) is the regression function of $ t +i:w(f, h), which only depends on ht+i, ■ ■ ■ , hr-i, we have 
for all ht g L 2 (R d ,Ht) 

\\ht - P t(h)\\l = E[\ht - %»(/, h)\ 2 - \p t (h) - %»(/, h)\ 2 ]. (6.12) 

Let h g 7i be arbitrary. Since 7Yi is convex and since Wu $t+i:w{f,h) = irn t Pt(h) minimizes 
the distance to pt{h) it follows that 

(pt(h) - mMh), h t - n ntPt {h)) < 0. (6.13) 

Therefore 

|| pr Ht d t+1:w (f, h) - Ma = IkwiPtW - Ma < llftCO " Ma " \\PtW - n Ht p t (h)\\l (6-14) 
Because both h t and TtUtPtih) are in we can apply (6.12) twice which shows that the right 
hand side of (6.14) is equal to 

□ 

For w = we immediately obtain from | max(a, s)-max(a, y)\ < \x—y\ and Jensen's inequality 
the uniform Lipschitz bound 

\\E[^ +1:0 (f,g)-^ +1:0 (f,h)\X t }\\ p <\\g t+1 -h t+1 \\ p . (6.16) 
More generally, we have the following conditional Lipschitz continuity at the continuation value. 
Proposition 6.4. For every h € L p (X) with hx = It an d < w < T — t 

\\E[0 t+lsm {f,h) | X t }-q t \\ P = \\EfiwMh) -max(/ t+1 ,<ft +1 ) | X t }\\ p 

= \\E[# t+1:w (f,h)-#t+i: W (f,q)\X t ]\\ p . (6.17) 

Furthermore, 

t+w+l 

\\E[§ t+l .. w {f,h) - ti t+1:w (f,q) I X t ]\\ P < H^-9-IIp- ( 6 - 18 ) 

s=t+l 

A similar estimate for the special case w — T — t — 1 can also be found in Clement et al. 
(2002). Note that the uniform Lipschitz estimate (6.16) does not extend to w > 0. Proposition 
6.4 only provides a Lipschitz estimate at the continuation value. 

Proof. First note that from the Markov property 

E[d t+1 .. w {q) - t+1 . M (h) \X t } = E[0 t+1 . M (q) - # t +i: W (h) | T t ]. (6.19) 

Equation (6.17) follows directly from the recursive definition of qt- The case w = is covered in 
(6.16). For w > it follows from the definition of $t+i:w that 

\\E[4 t+1 n,(q)-<& t+1:w (h) \T t ]\\ P < 

\\E[f t+1 (6 u+1 (q) - 9 u+1 (h)) + 9 u+1 ( q )i3 t+2:w ^(q) - ei t+1 {h)d t+2 .. w ^{h) I T t \\\ v . 
Adding and subtracting the term q t +i(9fj+i{q) — 6f t t+i(h)), the triangle inequality implies 
\\E[# t+1 .. w (q) - ti t+ i:w(h) \T t ]\\ P < \\E[(f t+1 - q t+1 )(6 f , t+ i(q) - 6 u+ i(h)) \ T t )\\ P + 

\\E[ej t+1 (q)§ t +2:v,-i(q) ~ 6j tt+1 (h)ti t+ 2: W -i(h) + q t +i (fl/.t+i (?) - 9 f , t+1 (h)) \ T t }\\ p . 

Now 

6f,t+i(q) ~ 0f,t+i(h) = l{/ t+1 > gt+1 } - l {f t+ i>h t+1 } 

= {0</t+i— qt+i<h t +i— qt+i] ~ Mfct+i— 9t+i</t+i— gt+i<0}) 
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which leads to 

(ft+l - ?t+l)(l{0</ t+ i-«t +I </K + i-<ft + i} _ l{/»t+i-9t+i</t+i-9t+i<0}) 

< (ht+1 - 9t+l)l{h t+1 -g t+1 >0} - (ht+i - 3t+l)l{h t+1 -g t+1 <0} 

< \h t+1 -q t+1 \. 

By the Markov property q t+l (X t+1 ) = E{& t+2:w -i(q) \ Ft+i]- Because 9 fjt+1 (q) and 9 f t+1 (h) 
are a(X 4+1 )-measurable it follows that 

E[q t+1 (6 Lt+1 (q) - 9 u+1 (h)) | T t ] = E[E[® t ^.. w _ x (q) I ^ +1 ](0/, t+1 ((?) - % + i(M) I ^ ] 

= £[0t+2it O -i(g)(0/,t+i(g) - 8/, (+ i(/i)) I ]■ (6.20) 

By Jensen's inequality, this leads to 

||£[l?t+l;4«) -T?t+l:«;W \Ft]\\p 

< \\q t+1 - h t+1 \\ p + \\E[&t + 2:w-i(q)(l - f ,t+i(h)) - tft+2: W -i{h)9j t+1 (h) I T t 
= - ftt + i||p + \\E[('d t+ 2: W -i(q) - $t+2:w-i(h))9j t+1 (h) I T t }\\ p 

< \\q t +i - h t+ i\\p + \\E[dt+2:w-i{q) - $t+2:w-i(h) \ T t+ i ]\\ p . 

The proof is completed by induction. □ 

Proof of Proposition 6. 1 . Introduce the regression function 

pn,t(x) = E[d t+1:w (f,q H ) \X t = x] (6.21) 

of tf t +i: W {f,qn) and let 

qn.t = TTHtPH.t = WHt $t+l;w(f, hi) (6.22) 

be its projection onto Tit- By the triangle inequality 

\\qn,t - qth ^ \\QH,t - m,th + \\QH,t - Pn,th + \\pn,t - Qth- (6.23) 

Again by the triangle inequality and because Tit is convex so that the projection n-n t from 
L 2 (M m ,/it) onto Tit is distance decreasing 

\\q-H,t —pH,th = \\TH t PH,t — PH,th 

< \\TH t Pn,t - KHtqth + \\n-Htqt - qth + \\qt - Pn,th (6.24) 

< \\^n t qt - qth + 2\\q t - pn,th- 

Inserting (6.24) back into (6.23) gives 

\\qn,t - qth < inf ||ft - q t 1 1 2 + Mn,t - qn,th + 3 \\Pn,t - qth- (6.25) 

heHt 

By Lemma 6.3 

\\qn,t - qn.th = \\qn,t - ^n t Pn,th < E[l t (q n )\ 1/2 . (6.26) 
For the third term in (6.25), by Proposition 6.4 

t+w + l 

\\pH,t-qth = \\E[#t+i: W (f,qH)-#t+i: W (f,q)\X t ]h< ]T ||«w,. - <fc|| 2 . (6.27) 



s=t+l 



□ 
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6.2. Covering Number Bounds. Wc define the so called centered loss class 

C t (H) = {l t (h) \heH}. (6.28) 

To bound the fluctuations of the sample error E^t^qq-i)] 1 / 2 later on in Section 6.3 we require 
bounds on the empirical Li-covering numbers N(s, C t (H), c?i,_p„) of the centered loss class. 

The first step is to bound the covering numbers of Ct(H) in terms of the covering numbers of 
Ht and the cash flow class which is defined as 

Gt = {0t+iMh) \h£H}. (6.29) 

Lemma 6.5. Let 1 < p < oo. IfHt is uniformly bounded by H and the cash flow class Gt by 
then for w > 

N{8(H + Q)e, Ct(H), d p>Pn ) < N (e, H t , d p , Pn ) 2 N {e, Gt,d p , Pn ) 2 . (6.30) 
For w = the estimate (6.30) simplifies to 

N(8{H + e)£,£ t (H),d p , Pn ) < N {e,Ht,d p , P J 2 N (e,H t+ ud p ,p n ) 2 . (6.31) 

Note that if the payoff functions / is in L x (X) and the approximation spaces Tit are uniformly 
bounded by H then $t+i :w (/, h) < = maxdl/Hoo, H) and the assumptions of Lemma 6.5 are 
satisfied. 



Proof. Wc first recall some basic properties of covering numbers. If T and Q are two classes of 
functions and T ± Q = {/ ± g \ f E g € G} is the class of formal sums or differences, then for 
all 1 < p < oo 

N(e, T±Q, d PtPn )<n{^,T 1 d p . Pn ) N (| , 0, d p . Pn ) . (6.32) 
Furthermore, if Q class of functions uniformly bounded by G, it follows from \\g\ — g^Wpp = 

Pn(gi - giTigi + g 2 ) p < (2Gy\\ 9l - 52 ||£ Pn that 

N(e,g 2 ,d PtPn )< N^,g,d p , Pn ) , (6.33) 
Enlarging a class increases the covering numbers. Now 

C t (H) c (H t - Gt? - (pr Wt Gt - Gt) 2 ■ (6.34) 
Because pr Wf Gt C Tit, it is sufficient to bound the covering number of the slightly larger class 

C t {H) = {Ht - Gt) 2 -(Ht-Gt) 2 . (6.35) 
If Tit is uniformly bounded by H < oo and dt+i, w {f, h) < 0, we get from (6.32) and (6.33) 

2 / \ 2 



N(e, Ct(H), d p , Pn ) < N I + Ht, d p>Pn j N [ — - — , Q t , d p , P „ j . (6.36) 

For w = the Lipschitz bound (6.16) directly leads to 

N(e,g t ,d p , Pn ) < N(e,H t+ ud p>Pn ). (6.37) 

(6.31) follows directly from (6.36) and (6.37). □ 

A simple example for which tight covering number bounds exists are subsets of linear vector 
spaces. If Ht = {h G K, \ ||/i||oo < R} and K, is a linear vector space of dimension d then 

N(e,H u d 2 ,p n ) < N(e,{heK, \ P n h 2 < R 2 },d 2 , Pn ) < (^-t^j . (6.38) 
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The first inequality in (6.38) is obvious because 7i t is a subset of {h £ JC \ P„h 2 < R 2 }. The 
second inequality is standard and can be found for instance in Carl and Stcphani (1990) or 
Van der Vaart and Wellncr (1996). 

(6.38) would provide uniform covering number estimates for (6.31) in case of linear approxi- 
mation spaces and w = 0. We can not apply (6.38) to upper bound the right hand side of (6.30) 
in the general situation w > because the cash flow class Qt is not anymore a subset of a linear 
space, even if the underlying approximation space TL t is a finite dimensional linear vector space. 
This is where the Vapnik-Chervonenkis theory comes into play. 

An important type of function classes for which good uniform estimates on the covering num- 
bers exist without assuming any linear structure are the so called Vapnik-Chervonenkis classes 
or VC-classes, introduced in Vapnik and Chervonenkis (1971) for classes of indicator functions, 
i.e., classes of sets. Let C be a class of subsets of a set S. We say that the class C picks out a 
subset A of a set a n = {xi, . . . , x n } C S of n elements if A = C n a n for some C £ C. The class 
C is said to shatter a n if each of its 2™ subset can be picked out by C. The VC-dimension of C is 
the largest integer n such that there exists a set of n points which can be shattered by C, i.e., 

vc(C) = sup{n| A n (C) = 2"}, (6.39) 

where 

A n (C) = max card{C n {x u . . . , x n } | C £ C] (6.40) 

{Xl,...,X n } 

is the so called growth or shattering function. A class C is called a Vapnik-Chervonenkis or 
VC-class if vc(C) < oo. A VC-class of dimension d shatters no set of d + 1 points. The "richer" 
the class C is, the larger the cardinality of sets which still can be shattered. We illustrate it by a 
simple example. The class of left open intervals {(— oo, c] | c £ R} cannot shatter any two-point 
set because it cannot pick out the largest of the two points and therefore has VC-dimension one. 
By similar reasoning, the class of intervals {(—a, 6] | a, b £ R} shatters two-point sets but fails 
to shatter three-point sets: it cannot pick out the largest and the smallest point of a three-point 
set. Contrary, the collection of closed convex subsets of 1R 2 has infinite VC-dimension: Consider 
a set o n of n points on the unit circle. Every subset A C a n of the 2 n subsets can be picked out 
by the closed convex hull co(A) of A. A peculiar property of a VC-class is that the shattering 
function of VC-classes grows only polynomially in n, more precisely we have the following result 
which is due to Sauer, Vapnik-Chervonenkis and Shclah, see Van der Vaart and Wellncr (1996), 
Corollary 2.6.3, or Dudley (1999). 

Lemma 6.6 (Sauer's Lemma). If C is a VC-class with VC-dimension d = vc(C), then 

i=0 v 7 

VC-classes have a variety of permanence properties which allow the construction of new VC- 
classes from basic VC-classes by simple operations such as complements, intersections, unions or 
products. We again refer to Van der Vaart and Wellner (1996, section 2.6.5), or Dudley (1999). 

The concept of VC-classes of sets can be extended to classes of functions in several ways. A 
common approach is to associate to a class of functions its subgraph class. More precisely, the 
subgraph of a real- valued function g on an arbitrary set S is defined as 

Gi(g) = {(x, t) £ S x R | t < g(x)}. (6.42) 

A class of real- valued functions Q on S is called a VC-subgraph class, or just VC-class, if its class 
of subgraphs is a VC-class and the VC-dimension of Q is defined as 

vc(£?) = vc({Gr( 5 ) | g £ £?}). (6.43) 
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An equivalent definition is obtained by extending the notion of shattering. A class of real- valued 
functions Q is said to shatter a set {x\, . . . ,x n } C S if there is r G 1™ such that for every 
b G {0, 1}™, there is a function g G Q such that for each i, g(xi) > 7"j if hi = 1, and g(Xi) < r, if 
bi = 0. The definition 

vc(G) = sup{n | 3{xi, . . . , x n } C S shattered by Q} (6.44) 

agrees with (6.43). For the proof note that a set is shattered by the subgraph class {Gr(g) \ g G Q} 
if and only if it is shattered by the class of indicator functions {9(g(x) — t) \ g G G}, where 
6(s) = l{ s >o}- The VC-dimcnsion (6.44) for classes of functions is often called pseudo-dimension, 
see Pollard (1990) and Haussler (1995). An alternative generalization is obtained by so called 
VC-major classes, originally introduced by Vapnik. For more details on the relation of the two 
concepts we refer to Dudley (1999). 

Lemma 6.7. Let Q be a finite dimensional real vector space of measurable real-valued functions. 
Then, the class of sets Q + = {{g > 0} | g G Q} is a VC class with vc(Q + ) < dim(C/). If go is a 
fixed function, then vc((<7o + G) + ) = vc(C? + ). Finally, Q is a VC-class and ~vc(G) = dim(^). 

Proof. For the first two statements we refer to Dudley (1999, theorem 4.2.1), or Van der Vaart 
and Wellner (1996, section 2.6). The last statements follows from the first two: Let go{x,t) = —t 
and consider the affine class of functions go + G on S X R. Then, the subgraph class of Q is 
precisely (g + Q) + . □ 

An important property of VC-classes is that their covering numbers N(e, Q ', d Pi ^) are polyno- 
mial in e _1 for e — ► 0. More precisely we have the following estimates for the covering numbers 
of VC-classes due to Haussler (1995), see also Van der Vaart and Wellner (1996, Theorem 2.6.7). 

Lemma 6.8. Let Q C L p (p) be a class of functions with an envelope G G L p (ij), i.e., g < G for 
all g G G ■ Then, 

N(e\\G\\ p ^g,d p ^<e(vc(g) + l)2^ \-j . (6.45) 

After this short digression on VC-thcory we continue estimating the empirical Li-covering 
numbers of the centered loss class C t {TL). The next result is fundamental to generalize the 
estimate (6.31) to a strictly positive look-ahead parameter w > 0. It bounds the VC-dimension 
of Qt in terms of the VC-dimension of the approximation spaces Ti-t+i, ■ ■ ■ ,T~it+w+i- 

Proposition 6.9. Assume that for all s >t, Ti s are VC-classes of functions with vc{TL s ) < d. 
Then Q t is a VC-class with VC-dimension 

vc(&) < c(w)d, (6.46) 

where c(w) = 2(w + 2) log 2 (e(w + 2)). 

Inequalities (6.30), (6.31), (6.45), and (6.46) finally lead to explicit uniform bounds for the 
empirical Li-covering numbers of the centered loss class £t(H). 

Corollary 6.10. Assume that all TL S are classes of function uniformly bounded by H and with 
bounded VC-dimension vc(7i s ) < d. If the cash flow function satisfies $t+i :u ,(/, h) < H , then 

N{e,C t {H),d hPn ) < (6.47) 
e 4 (d+ l) 2 (c(w)d + l) 2 f — - — J , for w>l, 

e 4 (d+l) 4 ^j 4d , forw = 0. 



20 



DANIEL EGLOFF 



Optimal stopping is a particular stochastic control problem with a simple control space. The 
proof of Proposition 6.9 relies on the observation that the VC-dimension of the class of indicator 
functions C s = {8f tS (h) \ h s £ Ti. s }, which appear in the definition of Tt(h) and $t+i :u) (/i), is 
bounded by vc(7i s ). It is an interesting question how Proposition 6.9 can be extended to more 
general stochastic control problems. 

Before we proceed to the proof of Proposition (6.9) we add a remark on VC-classes and their 
VC-dimension. Let A be a class of sets. The class of indicator functions {1^ | A G A} is a 
VC-class in the sense that its subgraph class is a VC-class if and only if A is a VC-class and 
vc(A) = vc({1a | A e .4}). Let 9(x) = l{ x >o}- If A is a VC-class, vc(„4) = d, then by Sauer's 
Lemma 6.6, for x\ , . . . , x n and all t G R™ 

/ p 77 \ d 

card{(0(l A (*i) ~ U))i=i,..., n \AeA}< (— ) . (6.48) 

Conversely, if we find a polynomial bound like (6.48), A must be a VC-class and we can bounds 
its VC-dimension. 

To prove Proposition 6.9 we first establish the following general result on VC-classes. 

Lemma 6.11. Let X, y be two sets and A, B VC-classes of subsets of X (respectively, y). 
Assume that vc(^4) < d, vc(B) < d. Let f : X — * K and g : y — > K be non-negative functions. 
Define the class of functions 

F{A,B) = {F A , B (x,y) = l A (x)f(x) + l A ,{x)l B (y)g{y) \A e A,B e B} (6.49) 

Then J 7 (A, B) is a VC-subgraph class, its growth function is bounded by 

A n (F(A,B)+)<(—) , (6.50) 

and 

vc(F{A,B)) < 2dlog 2 (e). (6.51) 
The estimates (6.50) and (6.51) generalize to 

F{A,H) = {F Ath (x,y) = l A {x)f{x) + l A .(x)h{y) \ A e A, h G H} (6.52) 
where 7i is a VC-class of function with vc(H) = vc(7i + ) < d. 

Proof. Given points (xi, y{) G X x y and ij G K, i = 1, . . . , n, we need to bound the cardinality 
of 

{(9(F A , B ( Xi , s/«) — *i))*=i « \A£A,B£B}, (6.53) 

as a subset of the binary cube {0, 1}™. Because 

F A ,b{x,d) = lB{yi){g{yi) - l A (xi)g(yi)) + l A (xi)f(xi), 
and (g(yi) - l A (x i )g(y l )) > we find that 



9{F AtB {x l ,y i ) - U) = 

{ 0(l B ( Vi ) - n{A)) on S+(A) = {( Xj , yj ) | l A c{ Xj )g{ yj ) > 0} 
1 e(l A (x t )f(x t ) - U) on S (A) = {{xj,Vj) | lA*(xj)g(vj) = 0} 



(6.54) 



where 

Ti(A) = —— — 6.55) 
giVi) - lA{xi)g{yi) 

Fix A and vary B over B. Because vc(6) < d we see from (6.54) and Sauer's lemma, that the 
binary set 

{(e(F A , B (xi, yi ) -**))(=!,...,„ | B G B} (6.56) 
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has cardinality K bounded above by (end~ 1 ) d . Let 

h(A),...,b K (A) (6.57) 

enumerate the distinct elements of (6.56) generated by sets B k . For (xi,yi) £ So(A) we have 

b kti (A)=6(l A (x i )f(x i )-t i ), (6.58) 

and if (x,,^) € S + (A) 

b k , i (A)=e(l Bk (y i )-T i (A)) = 

6(l A (xi) - n{B k )) on S+{B k ) = {(x^yj) \ f( Xj ) - l^x^g^) > 0} 

l-6{l A {xi)-n{B k )) on S-(B k )={(xj,Vj)\f fa) -l B (xjMVj)<0} ■ (6.59) 

Q0-B k (yi)g{yi) - U) on S a (B k ) = | f(xj) - l B (xj)g(yj) = 0} 

Consequently Sauer's lemma again implies that for each fixed k the binary set 

{b k (A) \ AeA} (6.60) 

has cardinality at most (end _1 ) d . This proves (6.50). Again by Sauer's lemma, very no > such 
that 

/ en\ 2t ^ 

card{(0{F AB {x l ,y l )~t l )) t = h ..., n \AeA,BeB}<( y — J < 2", (6.61) 

for all n > no is an upper bound of yc{T{A 1 B) + ). To find no, we look for solutions no = dj that 
are multiples of d. (6.61) leads to the condition 

log 2 (ej) < j, 

which is satisfied for example by j = 21og 2 (e). The extension to J-(A,H) is straightforward. 
Replace 0(l B (j/i) - n(A)) in (6.54) by 0(%i) - n(A)), where n{A) = {U - f{x l ))/l A o{x l ) and 
follow the same lines of reasoning. □ 

Proposition 6.9. Recall definition (3.16) of the cash flow function, according to which 

t+w t+w + 1 



# t+1:w {f,h) = e u+1 (h)f t+1 + ... + e f , t+w+1 (h) ej r (h)f t+w+1 + H 9j r (h)h t+w+1 . 

(6.62) 



r=t+l r=t+l 

Because the classes of indicator functions 

C s = {6 f , s (h) = l {/s - h ,>o} I h s e H s ], C- = {9- s (h) = l {/s _ h ,<o } I h s g H s }, (6.63) 
are VC classes with VC-dimcnsion 

vc(C7) - vc(C s ) = vc((/ s - H s )+) = vc(tt+) = vc(W s ) < d, (6.64) 
we can recursively apply Lemma 6.11 to derive the bound 

/ en\ d(w+2) 

aud{(0(#t+iM h)(xi) - ti))<=i,...,n \heH}< J . (6.65) 

The VC-dimension of Gt is then estimated as in the proof of Lemma 6.11. This completes the 
proof of Proposition 6.9. □ 
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6.3. Proof of Theorem 5.1 and Theorem 5.3. The centered loss lt(qn) depends on the sam- 
ple D n . To control the fluctuations of the random variable E[lt(qn)] we need uniform estimates 
over the whole centered loss class £t(7i). The usual procedure is to apply exponential deviation 
inequalities for the empirical process 

[y/n{E[l] - PJ) | I G C t (H)} (6.66) 

indexed by Ct(TL), which are closely related to Uniform Law of Large Numbers. For background 
we refer to Pollard (1984), Van der Vaart and Wellner (1996), Talagrand (1994), and Gyorfi et al. 
(2002). 

The application of standard deviation inequalities to the whole centered loss class Ct (7i) is not 
efficient since the empirical minimizer is close to the actual L2-minimizer with high probability. 
Therefore, the random element lt{qn) is with high probability in a small subset of £t(7i). To get 
sharper estimates, the empirical process needs to be localized such that more weight is assigned to 
these loss functions. Lee, Bartlett and Williamson (1996) proved the following localized deviation 
inequality. 

Theorem 6.12 (Lee et al. (1996), Theorem 6). Let £ be a class of functions such that \l\ < K\, 
E[l] > 0, and for some K 2 > 1, 

E[l 2 ] < K 2 E[l] VI e C. (6.67) 
Let a,b > and < 5 < \. Then, for all 

n - lim H S 2 (a + b) '^+6)J' (6 ' 68) 

E[l]-Pn(l) . A . 

sup —7- > ) < 

le lE[l]+a + b ~ J ~ 



at ( Sb „ , \ ( S 2 an 
4 sup N I — — , £, di p 2n exp 



where P 2n is the empirical measure supported at (x\, . . . ,X2n)- 



A similar bound has been obtained by Cucker and Smale (2001, Proposition 7) for Loo- 
covering numbers. Theorem 6.12 has been improved in Kohler (2000) by applying chaining 
techniques, and in Bartlett, Bousquet and Mendclson (2002) by using concentration properties of 
local Rademacher averages. For additional background on related bounds we refer to Talagrand 
(1994), Ledoux (1996), Massart (2000), and Rio (2001). The advantage of Theorem 6.12, as 
compared to the Pollard's deviation inequality, is that it improves the quadratic dependence on 
e in standard deviation inequalities to a linear dependence. 

The centered loss has a special structure which allows to bound its variance in terms of its 
expectation. 

Lemma 6.13. Let Tit be convex, uniformly bounded by H < 00, and assume that fit+i-.wif, h) < 
for some constant O < 00. Then the centered loss class Lt{H) is uniformly bounded and for 
all I G C t (H) 



\l\ < AH(Q + H), 
E[l 2 } < A(e + H) 2 E[l}. 



(6.70) 
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Proof. We get from the definition (6.3) of lt(h) that 

l t (h) = (h t -pr H J t+1:w (f,h))(h t +pv n J t+1:w (f,h)-2^ t+1:w (f,h)) (6.71) 
< 2(Q + H)(h t -WnJt+i: W (f,h)). 

Therefore, 

E[k{h) 2 ] < 4(6 + HfE[\h t - WHt d t \ 2 } < 4(9 + HfE[l t (h% 
where the last step follows form Lemma 6.3. □ 

Our plan is to apply Theorem 6.12 to a suitably scaled loss class 

XC t (H) = C t (H)}, (6.72) 

where we choose A such that \Xl\ < 1 (the scaling gives a term /3 2 in the consistency condition 
(5.3) instead of (3^). Because an empirical risk minimizcr satisfies P n (lt(qn)) < 0, it follows that 
for any e > and scaling factor A > 

P (E[k(qn)} >e) < P (E[l t (q n )] > 2P n (l t (q H )) + e) 

< ,( sup «ffl>i). (6.T3) 
\iexc t (H) E [l\ + Ae 2 J 

Assume that the conditions of Lemma 6.13 are satisfied and set (3 ~ max(9,i?). If we choose 
the scaling factor A = l/(8/3 2 ) the scaled class A£t(7i) satisfies 

|AZ| < 1, 

E[{Xl) 2 } < 2E[Xl]. (6.74) 
Theorem (6.12) applied with C = \£ t (H), K x = 1, K 2 = 2, a = b = e/(16/3 2 ), 6 = 1/2 implies 

p (« * £) * \^L N fe 1 w Lt{nUl ' p -) cxp (-6^) ' (675) 

for n > 382/3 2 /e. The e-covering number of \Ct(TL) is the same as the (A _1 e)-covering number 
of the unsealed class £t(7i). If the VC-dimcnsion of 7Y S . s > t are bounded by d the covering 
number bound (6.47) shows that 

HEum>e) < ^G)"«p(-5^). ( 6 - 76 ) 

where 

v = v(w, d) = 2d(c(w) + 1), (6.77) 

and 

K = K(d, w, (3) ^6e A {d+l) 2 {c{w)d+l) 2 {1024e(3) v ^ w \ (6.78) 

Proof of Theorem 5.1. (3 n is a sequence of truncation thresholds tending to infinity. If qp n ,t is the 
continuation value for the truncated payoff Tp n f we get from (5.6) that \\q t ~ qj3 n: t\\2 — * 0. The 
error decomposition (6.4) separates the approximation error and the sample error. The denseness 
assumption implies that the approximation error mih^n n t \[h — qp n ,t 1 1 2 tends to zero if n — > oo. 
It remains to analyze the sample error E[lt(qH n )] for underlying payoff Tp n f. 

We apply (6.76) to Tt = H n for which d = d n and (3 = (3 n . There exists a constant C(e,w) 
such that for every fixed e > 



Wt(9«J] > e) < C(e,w)exp ( d n log(f3 n ) - ) . 
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The right hand side converges to zero for every fixed e > if n/fJ^ diverges to infinity faster 
than d„log(/3„) or if d n 0^ log(/3 n )n _1 — > 0. Convergence in probability follows from (6.4) by 
induction. Convergence in Za(P) is shown by evaluating 



E[E[l t (q H J]] <£ + j V(E[l t (q H J] > t)dt, (6.80) 
using the estimate (6.79). Conditions (5.3) and (5.5) imply 

OO OO / 

J2nE[k(qnJ} > e) < C(e,u;)^exp(d n log(/3 n ) 



ne 



n—l n— 1 N 

d n fl'i logO„) 



6998/3^ 

= C(e,w)J2n ^^K e99S » ) <oo. (6.81) 



n=l 



Almost sure convergence follows from the Borel-Cantelli Lemma. □ 
Proof of Theorem 5.3. Integrating (6.76) over e shows that for any n > — 

/•OO 

E[E[k(q H )}} = / P»(^)]>e)de 
Jo 



OO 

ne 



< K + AV^ exp [~^- 2 ) de 

< K + ^'- 1 6998/3 2 exp^^ 5 ) 



Setting 



leads to the upper bound 



6998/3 



2 



log (6998A/3V) , (6.83) 



E [£[/ tM ] < ^ + log(6998A^) + *log(n) ^ 



Corollary 6.2 implies that 

E - <ft||i] < 2 • 16-+ 1 ( max inf ||/i - q a \\* + 

\ s=t,...,t+w + l ftgHs 



max E[l s (q H )} 

s=£,...,£+iu+l 



But 

3 



max E[l s (q n )} 

s— £,...,£+u;+l 



<( W + 2) mu Eft(j«| 

S — ,,...,t + -ii?-f 1 



Apply (6.84) to complete the proof. □ 
Proof of Corollary 5.4- Estimate (6.76) implies 

P tEMVH)] > e) < Aexp (-^) exp (-^ - Iog( e )„) . (6.85) 
By straightforward calculations, the right hand side is smaller than 5 for all n satisfying 

n > 13996/3 2 max Q log (^j ,v\og . (6.86) 

The sample complexity bound (5.16) follows from Corollary 6.2 and (6.85), (6.86) with e = 
e/(32(w + 2)W w ). 

□ 
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6.4. Proof of Corollary 5.5. Because q t £ W k (L oa (I, A)), Jackson type estimates imply that 
for every r > k there exists a polynomial p r £ V r 

\\Pr - qt\\oo,i,x < C ir - k ||g t ||oo,fc,/,A- (6.87) 

The constant Cj only depends on I but not on r or q t . See for instance DeVore and Lorentz 
(1993, Theorem 6.2, Chapter 7). Consequently 

lbr||oo,/,A < \\Pr - qt\\oo,I,X + || <7t II oo,/,A < 2 || q t |j oo,k,I, A (6.88) 

for r sufficiently large. We therefore may restrict the minimization to the convex, uniformly 
bounded set of functions H n j as defined in (5.17). The VC-dimension of H n .t is bounded by 
n m/(m+2fc)_ xheorem 5.3 applies. Because X t is localized to /, the approximation error in (5.13) 
is bounded by 

inf b-<Zt||!< inf \\ P - qt \\l J . x <C I n- 2k /^ + ^\\ qt \U k j A . (6.89) 
Inserting vc(Ti n ,t) < n m /i m + 2k ) j n t (5. 13) shows that the sample error is of the order 

O (\og{n)n- 2k ^ 2k+m ^ . 

The extension to fit with bounded density with respect to Lebesgue measure is proved identically. 

□ 



6.5. Proof of Proposition 5.2. Note that 

| max(a, x) — max(a, y)\ < \x — y\ . (6.90) 
The representation of the continuation value in terms of the transition functions gives 

ht- qp.tWp = ll £, [ m ax(/t + i,g t+ i)) - max(T^/ t+1 ,^ i+ i) | X t ]|| p 

< ||£[/ t+1 - T p f t+1 | X t ]|| p + \\E[q t+1 - q p>t+1 \ X t ]\\ p 

< Uft+i ~ P)l{f t+1 >(3}\\p + ht+i - qp,t+i\\ P - 

If / e L P (X), then ||(/t+i — P)^{f t+1 >/3}\\p for /3 — * oo. We first recall that for a nonnegative 
random variable Y and r > 1 



y r - 1 P{Y>y)dy. (6.91) 



E[Y r ] = r f 
Jo 

Then (5.7) follow from 

poo 

\\(f t+1 -f3)l{f t+1 >P}\\: = rj u r - 1 P((fa 1 -i3)l {Ul> f l} >u)du 

( u -/3)'- 1 P(/ t+1 >u)du 

< r I u r_1 P(if +1 > u p ) du 

< -^E[f t+1 W~v < 0(i3 r -v), 

where we have used Markov's inequality to get to the last line. □ 
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